Another Two - Level Failure Recovery Scheme : Performance

نویسنده

Nitin H. Vaidya

چکیده

This report deals with the design and evaluation of a \two-level" failure recovery scheme for distributed systems. In our previous work 30, 32], we motivated a \two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recovery schemes. The contributions of this report are summarized below: We present and evaluate a \two-level" recovery scheme that is suitable for a network of workstations, each workstation having a local disk. The recovery scheme presented in the report can tolerate transient processor failures with a low overhead , while other failures require a larger overhead. The report presents analysis of the average (expected) task completion time using the proposed scheme. This scheme has been implemented on a workstation cluster. Our analysis indicates that the proposed two-level recovery scheme can achieve better performance as compared to existing \one-level" recovery schemes. The report also evaluates the impact of checkpoint latency on the performance of the recovery scheme. To our knowledge, no analysis of the performance impact of checkpoint latency has been carried out previously. Experimental measurements of checkpoint latency and checkpoint overhead for four applications are presented. References 32, 30] present material related to this report. The interested reader can obtain these references via anonymous ftp from ftp.cs.tamu.edu:/pub/vaidya. y This report was revised several times in January 1995. The purpose of these revisions was to add Sections 10 and 11, and to revise Section 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another promising method is in the algorithm level, called algorithmic recovery. These two methods can achieve high efficiency when the system scale is not very large, b...

متن کامل

Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefo...

متن کامل

A Fast Rollback-Recovery Scheme based on Optimistic Message Logging

This paper presents an eecient rollback recovery scheme based on the optimistic message logging. To speed up the recovery process, the rollback point of the failed process is broadcast and other processes asynchronously make the rollback decision based on the vector time. Asynchronous recovery process usually causes two possible problems: One is the message delivered from an invalid state inter...

متن کامل

An Efficient Rerouting Scheme for MPLS-Based Recovery and Its Performance Evaluation

The path recovery in MPLS is the technique to reroute traffic around a failure or congestion in a LSP. Currently, there are two kinds of model for path recovery: rerouting and protection switching. The existing schemes based on rerouting model have the disadvantage of more difficulty in handling node failures or concurrent node faults. Similarly, the existing schemes based on protection switchi...

متن کامل

A Case for Multi-Level Distributed Recovery Schemes

Most of the distributed recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures, with a few notable exceptions of schemes designed to tolerate single failures. In this report, we demonstrate that, it is often advantageous to use \multi-level" recovery schemes. A \multi-level" recovery scheme is one that can tolerate diierent number of faults at diierent...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

Another Two - Level Failure Recovery Scheme : Performance

نویسنده

چکیده

منابع مشابه

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

A Fast Rollback-Recovery Scheme based on Optimistic Message Logging

An Efficient Rerouting Scheme for MPLS-Based Recovery and Its Performance Evaluation

A Case for Multi-Level Distributed Recovery Schemes

عنوان ژورنال:

اشتراک گذاری